-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MFMA] Switch between MFMA types #352
[MFMA] Switch between MFMA types #352
Conversation
First need to merge #251 |
15c204d
to
1b3f6a0
Compare
@@ -0,0 +1,251 @@ | |||
#include "mlir/IR/TypeUtilities.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've Separated this code from common AccelerateMatmul pass, so I can add an additional option to it.
Do you think it is ok to do this in this PR or is it better to separate it?
@@ -309,14 +310,15 @@ def make_hash(fn, arch, env_vars, **kwargs): | |||
num_ctas = kwargs.get("num_ctas", 1) | |||
num_stages = kwargs.get("num_stages", 3) | |||
waves_per_eu = kwargs.get("waves_per_eu", 0) | |||
matrix_instr_nonkdim = kwargs.get("matrix_instr_nonkdim", 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alefimov-amd @oplavsic After dealing with tuning parameters for a while, I'm wondering why we need to add new tuning parameters explicitly, instead of treating them as constants
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only benefit to add them explicitly is that we can still tune them even they are not explicitly defined as kernel arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have am example of such use?
Maybe I did not understand your idea correctly, I feel that this could be more error prone.
P.s. I also feel that adding tons of parameters is not the best way, and we probably need to find some more elegant way to add them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I added pre_load_v
as a tuning parameter, I just added it in the config of the autotuner and kernel argument as a tl.constexpr
. Nothing is changed in the compiler.py. And it is treated as BLOCK_M
instead of num_warps
.
It seems that there are two kinds of kernel arguments: one is meta-parameters like BLOCK_M, the other is compilation options like num_warps according to the explanation here: https://github.com/ROCmSoftwarePlatform/triton/blob/461d72e5477d1659dc05e10060db4db3907c958f/python/tutorials/03-matrix-multiplication.py#L162
And the only difference between to two is whether we can set default values to them. For meta-parameters, if nothing is set, there will be an error like missing 1 required positional argument: 'pre_load_v'
.
P.s. I think both kinds are compilation options, since the kernel needs to be recompiled if the values is changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting approach, it definitely worth to try it.
I have only one concern about it.
User have to declare this constant by itself, and if he/she make a mistake, this mistake will not be reported.
For example, we use MATRIX_INSTR_NONKDIM
constant to control MFMA behavior, user can write this code:
@triton.jit
def kernel(MTRIX_INSTR_NONKDIM: tl.constexpr):
...
kernel[grid](MTRIX_INSTR_NONKDIM = 16)
This code is correct in therm of a language, but it does not do what we want and it mistake is not reported.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking, maybe we can introduce some additional decorator to pass AMD specific options to kernel without messing with upstream interfaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point.
MATRIX_INSTR_NONKDIM
and waves_per_eu
are needed explicitly in the lowering passes. However, pre_load_v
and BLOCK_M
are only needed in the python level frontend, so the compile()
function don't care about them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream has added a lot of hopper specific parameters to the list already, which is far from clean at all.
I agree with you that we should have a "bag" for AMD options. And we should also suggest to upstream to put all these NVIDIA parameters into another bag.
1b3f6a0
to
f43b54e
Compare
f43b54e
to
fcdb690
Compare
Option<"matrixCoreVersion", "matrix-core-version", | ||
"int32_t", /*default*/"0", | ||
"device matrix core version">, | ||
Option<"matrixInstructionSize", "matrix-instructio-size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
"device matrix core version">, | ||
Option<"matrixInstructionSize", "matrix-instructio-size", | ||
"int32_t", /*default*/"0", | ||
"enforce matrix intrucion MN size"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
// layout are 32 apart: [[0 0 0 0 32 32 32 32 ...] [1 1 1 1 33 33 33 33 | ||
// ...] ...]. for mfma 16x16 adjacent threads in y dimension in | ||
// transposed MFMA layout are 16 apart: [[0 0 0 0 16 16 16 16 32 32 32 | ||
// 32 ...] [1 1 1 1 33 33 33 33 ...] ...]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get the waveSize from the gpu dialect or mfma layout?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately no...
However! MFMA layout appears in IR only if target is CDNA architecture, which has only 64 waves mode.
I think it should be safe to use constant here.
In my opinion we should report MFMA layout on non CDNA GPU as an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really want, it is possible to infer waveSize from mfmaLayout by computing a product of mfmaLayout.threadsPerWarp. But that is a little "ugly" in my opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And you can also get it from gpu dialect like here: https://github.com/ROCmSoftwarePlatform/triton/blob/4d539d7dae055bb6b8dbb1b2b380118333250f15/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L589
This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Some notes:
- Documentations about AMD mfma instruction usage is planned in the future PR
- [GEMM] [Tuning] Parameterize mfma type #366 is needed for the gemm tuning script to use the correct mfma type
fcdb690
to
c0a0664
Compare
This PR introduces matrix_instr_nonkdim flag to switch between MFMA 16 and MFMA 32.
This PR introduces matrix_instr_nonkdim flag to switch
between MFMA 16 and MFMA 32.